Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

zpinto · 2023-11-21T00:54:42Z

Issues

We found that there was significant shuffling occurring when attempting to perform a Node Swap operation on WAGED clusters. #2662

Description

After investigating, it was determined that the cause was due to the PartitionMovementConstraint and BaselineInfluenceConstraint soft constraints improperly scoring AssignableReplicas during the completion of a SWAP.

This was because those two soft-constraints were using InstanceName instead of LogicaId to determine if the SWAP_IN node was in the baseline or best possible state. It wouldn't be since the last time that the algorithm produced the baseline and best possible state, it contained the SWAP_OUT node instead. This causes the score to be lower and leads to incorrect assignment.

To fix this, those soft-constraints now look for LogicalId and the baseline and best possible state are passed in with instanceName replaced with logicalId.

Also, we are putting all of the logic for what instance's are assignable in one place, the BaseControllerDataProvider. We will also add Evacuation here in the future.

Tests

The previous integration tests have been modified to cover the changes.

We have also tested this logic on a production cluster copied to a testing environment.

Changes that Break Backward Compatibility (Optional)

NA

Despite adding new getters to BaseControllerDataProvider, the old signatures are kept with the same behavour.

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

…e picking assignable instances in the cache.

…eOutput into method.

xyuanlu

Thanks for making the change, One general question: for spectator and EV, I think there are places where we still use getInstanceConfigMap. Will this include both swap in/out instances?

helix-core/src/main/java/org/apache/helix/model/ResourceAssignment.java

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java

helix-core/src/main/java/org/apache/helix/controller/changedetector/ResourceChangeSnapshot.java

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java

…icas need to be assigned based off of logicalId instead of instanceName.

xyuanlu

One more question:
When we throttle rebalance with ERROR state replica, we count all replicas on swap-in and out. Do we need to ignore replicas on swap-in instance?
In IntermediateStateCalcState

helix-core/src/main/java/org/apache/helix/util/HelixUtil.java

xyuanlu · 2023-11-22T18:20:58Z

helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java

@@ -239,7 +290,8 @@ private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvid
      final HelixManager manager) {
    int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
    if (maxOfflineInstancesAllowed >= 0) {
-      int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
+      int offlineCount =
+          cache.getAssignableInstances().size() - cache.getAssignableEnabledLiveInstances().size();


Here might be tricky for Evacuate. For swap, because swap-in instances are mirror, so it is not considered in offline limit for EMM. But Evacuate will reduce capacity, we may want to separate these 2 cases.
Not related to this change though.

helix-core/src/main/java/org/apache/helix/controller/rebalancer/DelayedAutoRebalancer.java

…are used.

zpinto · 2023-11-22T20:03:46Z

Failed tests on previous execution failed due to known flaky tests:

Test failed: testGetAllInstances(org.apache.helix.rest.server.TestInstancesAccessor) #2667 (there is PR for fix on this one)
Test failed: testCacheDataUpdates(org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache) #2693

I was able to pass both of these tests locally.

helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java

zpinto · 2023-11-22T22:11:26Z

Thanks for making the change, One general question: for spectator and EV, I think there are places where we still use getInstanceConfigMap. Will this include both swap in/out instances?

Will be addressing this in a future PR for spectator.

zpinto · 2023-11-22T22:23:28Z

Will add comment with TODO to address 2->3 ST messages not being sent and throttled due to more messages being sent to SWAP_IN and counting against throttles.

This is also a known issue already and we should figure out a way to prioritize ST messages.

xyuanlu · 2023-11-22T22:24:09Z

Had an offline review session and discussed several points. Generally looking good. Thanks for working on this!

…out hosts it in top or second top state, add throttling todos, and move assignable maps in cache out of propert cache block.

…eLogicalIds with stream instead of parallelStream to prevent concurrent modification exception.

… the same cluster as TestPerInstanceAccessor which adds an evacuate instance to the cluster. Any time TestPerInstanceAccessor runs first, it will cause TestInstanceAccessor.getAllInstances to fail. Using a new cluster for TestInstanceAccessor fixes the issue.

zpinto · 2023-11-28T08:52:50Z

Was able to fix TestInstanceAccessor.getAllInstances test.

Test failed: testCacheDataUpdates(org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache) #2693 which is now fixed by #2705

Assuming that review looks good, this PR is ready to be merged.

Final Commit Message:
Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.

zpinto · 2023-11-28T17:10:13Z

Thank you so much for the review @xyuanlu!

This PR is ready to be merged.

Final Commit Message:
Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.

…e picking assignable instances in the cache. (#2702) Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.

zpinto added 2 commits November 20, 2023 16:52

Fix WAGED to only use logicalId when computing baseline and centraliz…

13a9809

…e picking assignable instances in the cache.

Remove unnecessary debug logging and refactor add to bestPossibleStat…

a3ef1eb

…eOutput into method.

xyuanlu reviewed Nov 21, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/model/ResourceAssignment.java Outdated Show resolved Hide resolved

xyuanlu reviewed Nov 21, 2023

View reviewed changes

zpinto added 3 commits November 21, 2023 18:06

Address all failing test cases and some fixes for bugs in initial PR.

b517fa0

Fix final set of failing tests and make logic to determine which repl…

cb1c83b

…icas need to be assigned based off of logicalId instead of instanceName.

Update the methods in REsourceChangeSnapshot.

23b9cdc

zpinto marked this pull request as ready for review November 22, 2023 16:48

xyuanlu reviewed Nov 22, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/rebalancer/DelayedAutoRebalancer.java Outdated Show resolved Hide resolved

Fix naming and log statements to make it clear that assignable nodes …

5dfc9da

…are used.

xyuanlu reviewed Nov 22, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java Outdated Show resolved Hide resolved

zpinto added 3 commits November 26, 2023 20:17

Incorporate feedback to only have replica put on swap-in if the swap-…

a89eefb

…out hosts it in top or second top state, add throttling todos, and move assignable maps in cache out of propert cache block.

Replace iterating over live instances to create assignableLiveInstanc…

4a78c3c

…eLogicalIds with stream instead of parallelStream to prevent concurrent modification exception.

xyuanlu approved these changes Nov 28, 2023

View reviewed changes

xyuanlu merged commit 310b946 into apache:ApplicationClusterManager Nov 28, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

zpinto commented Nov 21, 2023 •

edited

Loading

xyuanlu left a comment

xyuanlu left a comment

xyuanlu Nov 22, 2023

zpinto commented Nov 22, 2023 •

edited

Loading

zpinto commented Nov 22, 2023

zpinto commented Nov 22, 2023

xyuanlu commented Nov 22, 2023

zpinto commented Nov 28, 2023

zpinto commented Nov 28, 2023

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

Conversation

zpinto commented Nov 21, 2023 • edited Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality

xyuanlu left a comment

Choose a reason for hiding this comment

xyuanlu left a comment

Choose a reason for hiding this comment

xyuanlu Nov 22, 2023

Choose a reason for hiding this comment

zpinto commented Nov 22, 2023 • edited Loading

zpinto commented Nov 22, 2023

zpinto commented Nov 22, 2023

xyuanlu commented Nov 22, 2023

zpinto commented Nov 28, 2023

zpinto commented Nov 28, 2023

zpinto commented Nov 21, 2023 •

edited

Loading

zpinto commented Nov 22, 2023 •

edited

Loading